Predicting survival of glioblastoma from automatic whole-brain and tumor segmentation of MR images

Survival prediction models can potentially be used to guide treatment of glioblastoma patients. However, currently available MR imaging biomarkers holding prognostic information are often challenging to interpret, have difficulties generalizing across data acquisitions, or are only applicable to pre-operative MR data. In this paper we aim to address these issues by introducing novel imaging features that can be automatically computed from MR images and fed into machine learning models to predict patient survival. The features we propose have a direct anatomical–functional interpretation: They measure the deformation caused by the tumor on the surrounding brain structures, comparing the shape of various structures in the patient’s brain to their expected shape in healthy individuals. To obtain the required segmentations, we use an automatic method that is contrast-adaptive and robust to missing modalities, making the features generalizable across scanners and imaging protocols. Since the features we propose do not depend on characteristics of the tumor region itself, they are also applicable to post-operative images, which have been much less studied in the context of survival prediction. Using experiments involving both pre- and post-operative data, we show that the proposed features carry prognostic value in terms of overall- and progression-free survival, over and above that of conventional non-imaging features.

www.nature.com/scientificreports/ interpretability of features is important: If a model cannot give interpretable explanations of its predictions, clinicians may not trust the model enough to factor its predictions into their decisions, even if the model is accurate 21 . Interpretable models may also uncover patterns in the data that give valuable new insight into the disease, and inspire future research. • Difficulties generalizing: The reproducibility of studies using radiomics has been shown to be less than ideal, with results failing to generalize well across scanners and software implementations [22][23][24][25] . Since many radiomic features depend directly on raw image intensities, they are sensitive to subtle changes in scanning equipment and image acquisition parameters. Furthermore, both textural and shape features depend on the segmentation mask that is used 26 , underlining the importance of using image segmentation methods that are robust with respect to such sources of variation. • Focus on pre-operative data: Compared to pre-operative images of glioblastoma, relatively little attention has been given to radiomics and other biomarkers in post-operative images. The reason may be that postoperatively, tumor shape and textural features are less easily detectable, as a large part of the tumor is usually removed. Nevertheless, post-operative images are collected closer to the time of disease progression and contain information about the success of operation, making them important to consider in a survival model. While the volume of tumor in post-operative images has been shown to correlate with OS 27,28 , more advanced imaging biomarkers in post-operative images remain mostly unexplored.
In this paper, we propose a method that aims to start addressing these shortcomings. Rather than focusing on in-region radiomic features of the tumor itself, we look at out-of-region features that are more straightforward to interpret and that can readily be applied both to pre-and post-operative data. For this purpose, we take advantage of a recently proposed method to robustly segment dozens of neuroanatomical structures in the presence of tumors 29 . Because this method aims to be invariant to imaging variations, it can be directly applied to data acquired at different centers with different scanners and protocols. We demonstrate the resulting survival prediction method on two fundamentally different datasets: one pre-and one post-operative dataset, each acquired with different scanners, MR contrasts, and pre-processing workflows. Our results show that the proposed features improve the performance of survival models for both overall-and progression-free survival, compared to models based only on several previously known prognostic factors. To the best of our knowledge, this is the first time a survival model for glioblastoma has been proposed that characterizes a detailed segmentation of the surrounding brain , although related registration-based work exists [30][31][32] .

Methods
The method we propose for survival prediction consists of three steps, illustrated in Fig. 1. The first step is to segment the images with a contrast-adaptive whole-brain segmentation method, simultaneously segmenting dozens of brain structures and the tumor. In the second step, features are computed by comparing each segmented structure to its expected healthy shape using the 95% Hausdorff distance. In the third step, the extracted features are fed into a feature selector and a survival prediction model.

Image segmentation.
For segmentation we use a method that we recently developed 29 , in which three tumor components (edema, enhancing core, and non-enhancing core) and dozens of neuroanatomical structures are automatically delineated from a patient's brain MR scan. For post-operative scans, another component is added to capture resection cavity. The method builds on a tool for whole-brain segmentation called Sequence Adaptive Multimodal SEGmentation (SAMSEG), which is distributed with the open-source software suite FreeSurfer 33 . It robustly segments head MR scans without any form of preprocessing, using an algorithm that can analyze multimodal data and adapt to variations in contrast due to differences in acquisition hardware or pulse sequences 34 .
SAMSEG is centered on a probabilistic atlas that encodes the spatially varying voxel-wise prior probability of 41 different structures in an average-shaped head 35 . This atlas is augmented with a deformation model warping it to match the anatomy of individual subjects, along with models of the MR bias field and of structure-specific intensity profiles. At segmentation time, these models are fitted to the image being segmented and then used to compute an automatic segmentation (Fig. 1).
For the purpose of segmenting scans with brain tumors, the basic SAMSEG model is further augmented with a spatial regularization model of tumor shape using generative neural networks 29 . Although in its original formulation we used convolutional restricted Boltzmann machines 36 for this purpose, our current implementation has variational autoencoders 37 since these have a deeper structure and can therefore better represent lesion shape 38 .
Feature extraction. Once segmentations are available, we aim to extract features that can sensitively measure the effect the brain tumor has on the shape of the various neuroanatomical structures, compared to those seen in healthy individuals ( Fig. 1 (Step 2)). To facilitate comparisons between individuals, we compute the features in atlas space, i.e., we warp the automatic segmentations back onto the average-shaped head model by applying the deformation fields that were estimated as part of the segmentation process. The resulting warped, subject-specific segmentations can then be compared to an "average" head segmentation that does not take any intensity information into account, obtained by assigning each voxel to the structure with the highest probability in the atlas. We will refer to this "average" head segmentation as the atlas segmentation. In healthy individuals, the subject-specific warped segmentations will be fairly close to the atlas segmentation in non-cortical structures after warping into atlas space, whereas in brain tumor patients the difference will often be much larger. www.nature.com/scientificreports/ In order to quantitatively compare the two segmentations, we compute a robust version of the Hausdorff distance 39 for each of 26 relevant structures: Accumbens area (L &R), amygdala (L &R), brain stem, caudate (L &R), cerebellum cortex (L &R) , cerebral cortex (L &R), hippocampus (L &R), lateral ventricle (L &R), optic chiasm, pallidum (L &R), putamen (L &R), thalamus (L &R), ventral diencephalon (L &R), 3rd-and 4th-ventricles. The Hausdorff distance measures the distance between the outer borders of a pair of segmentation masks; its robust version is an often-used metric to quantify the performance of automatic segmentation methods with respect to manual "ground truth" delineations performed by human experts 40 . Let A and B denote the outer border of the segmentation masks of a particular brain structure, obtained from the atlas and warped segmentation, respectively. The Hausdorff distance computes, for all voxels on the border A, the shortest Euclidean distance to voxels on the border B, and vice versa, and returns the maximum value over all the computed distances. Because the maximum distance is highly sensitive to outliers, the robust version instead returns the 95th percentile of the distances (Fig. 2). The robust version is often called the 95% Hausdorff distance but for short, will be referred to as Hd95 throughout the rest of the paper.  www.nature.com/scientificreports/ In cases where no voxel is assigned to a structure when obtaining the automatic segmentation, the Hd95 is not defined. In such cases, we instead use a single voxel located at the center of mass of the corresponding atlas segmentation.
For an example of how Hd95 captures the deformation of brain structures, Fig. 3 shows two subjects with glioblastoma ( Fig. 3B,C) and the corresponding atlas segmentation ( The proposed Hd95 features contain some information about where the tumor is located in the brain and its size, both of which have been studied before and shown to carry prognostic value 6,9,28,[41][42][43][44] . To verify that any prognostic value of our features is not solely based on tumor size and location, in our experiments we also evaluate the performance of our survival prediction models when they are trained directly on the estimated tumor size and the center-of-mass (CoM) coordinates of the whole tumor (defined as the set of voxels assigned to any tumor component). The contrast-enhancing tumor volume (CEV) is the tumor size definition most widely used clinically, but we will also consider the volume of each tumor component (TCV), including resection cavity in case of post-operative images , the extent of which has previously been shown to be an important independent prognostic factor 45 .
Survival prediction. Survival predictions models were trained following a standard machine learning workflow. The workflow consists of feature selection to remove uninformative features, and subsequent fitting of a survival prediction model to the remaining features (see Fig. 1

(Step 3)).
For feature selection, we used the univariate Cox proportional hazards (Cox PH) model 46 (implemented in python 47 ), considering one feature at a time and retaining it if its coefficient is significantly nonzero. We used two-sided P values and considered P < 0.05 statistically significant.
A random survival forest (RSF) 48 was used as the prediction model (implemented in python 49 ). RSF extends the random forest model 50 to handle right-censored data, i.e., subjects who had not yet died by the end of the study-knowing that these subjects survived at least until their recorded time still contributes to fitting the RSF parameters. The RSF is an ensemble of trees whose leaf nodes estimate the subject's survival function from training data seen by the node. The survival prediction for a subject is taken as the expected survival of the average survival function across all leaf nodes the subject visits. Due to the small number of subjects in our datasets, we did not optimize over the RSF hyperparameters but left them at the default setting in the survival analysis software: 100 trees, no maximum depth, 6 subjects minimum to split a node, and minimum 3 subjects in leaf nodes. Models were trained via K-fold cross-validation where K was chosen such that in each fold, 5 subjects were left out while the model was trained on the remaining N-5 subjects (K= N/5 ); the model was then used to predict survival of the 5 left-out subjects. We repeated this procedure 100 times for more accurate estimation of model performance.

Experiments and results
To demonstrate the versatility and reproducibility of the proposed method across data acquisitions, we performed experiments on two fundamentally different datasets: an in-house dataset of post-operative scans, and a publicly accessible dataset of pre-operative scans. Here we first describe these datasets, and subsequently present results for each.  www.nature.com/scientificreports/ written informed consent since the study operated exclusively on anonymized retrospective data. This study was performed according to the Declaration of Helsinki and Danish legislation.

Datasets. Copenhagen dataset (post-operative).
Our primary focus is on a set of post-operative scans acquired at Rigshospitalet, Copenhagen. It contains MR scans of 146 histologically verified glioblastoma patients, diagnosed in the period September 2011-April 2014 using the WHO standard at the time 2 . Using the 2021 WHO standard 1 retrospectively, the distribution of IDH mutation types in these subjects is as follows: GBM, IDH wild type, WHO grade IV (n = 119); Astrocytoma, IDH-mutant, WHO grade IV (previously Glioblastoma, IDH-mutant) (n = 9); and GBM, NOS (Not Otherwise Specified) (n = 19). Each patient received radiation therapy with concomitant and adjuvant temozolomide (see 6 for details about treatment). OS and PFS were recorded in months for all subjects with 14 and 6 censored subjects (i.e. still alive/non-progression at the end of the study), respectively. MR scans were acquired for radiation planning 2-3 weeks post-operatively. The acquired MR modalities included 3D T1 (MPRAGE) post-administration of gadolinium contrast (T1c), T2, and FLAIR ( Fig. 4A-C), using a 1.5T Siemens Espree scanner. The T1c scans were acquired using a voxel size of 0.5 × 0.5 × 1.0 mm 3 (matrix size 384 × 512 × 176 ); the FLAIR scans with a voxel size of 0.45 × 0.45 × 3.3 mm 3 (matrix size 448 × 512 × 40 ); and the T2 scans using a voxel size of 0.3 × 0.3 × 3.3 mm 3 (matrix size 672 × 768 × 39 ). As the only form of preprocessing, intra-subject registration and resampling to 1 mm 3 resolution was performed using FLIRT 51 . Three out of the 146 subjects were excluded as their post-operative MR data was unavailable. Out of the remaining 143 subjects, 11 were missing FLAIR scans and 3 were missing T2. However, our segmentation algorithm is robust with respect to missing modalities, allowing all 143 subjects to be included in the study.
Additional features recorded in the clinic were the patient's age, performance status, and MGMT protein status. As mentioned in the introduction, these are features that have been previously shown to have prognostic value and are thus commonly considered for radiotherapy planning. We will refer to this set of variables as the "clinical features".
BraTS20 dataset (pre-operative). To test the reproducibility of the methods we propose, we also applied them to a fundamentally different (namely, pre-operative) dataset, obtained with other acquisition settings and pre- www.nature.com/scientificreports/ processing steps. The Multi-modal Brain Tumor Segmentation Challenge 2020 (BraTS20) 52-55 released a publicly available set of 235 high-grade glioma subjects with overall-survival times. This dataset contains both glioblastoma and anaplastic astrocytoma 40 , although more detailed information on the subjects' sub-classification is not provided. For each subject, information on their age and OS is provided, but PFS or other clinical features are not available. None of the 235 subjects are censored. The MR scans originate from multiple clinics and were acquired on different scanners, with magnetic field strengths of 1.5T and 3T. For each subject, the dataset contains a T1 pre-and post-administration of gadolinium contrast, a T2, and a T2 FLAIR scan (Fig. 5A-D). In a pre-processing step, the images were aligned to a brain template, interpolated to 1 mm 3 isotropic resolution, and skull-stripped by the challenge organizers 40,52 . Despite differences in available MR contrasts and in pre-processing compared to the Copenhagen dataset, our segmentation method did not need adjustment to handle the BraTS20 data (see in Fig. 5E-G, the manual and automatic tumor segmentation along with the whole-brain segmentation).

Results on the Copenhagen dataset.
We present several different aspects of the proposed prediction method. First, we look at which Hd95 features were automatically selected for inclusion in the survival models. We then make a comparison between models trained on different feature sets, and we test the proposed method's ability to stratify patients into high-and low-risk groups based on their predictions. Finally, we evaluate the discriminative power of individual features for predicting short and long survival. In order to demonstrate the robustness of the proposed method to potential inaccuracies in the automatic segmentations, we also include simulation experiments in which the Hd95 distances are artificially enlarged as supplementary material.
Feature selection. Rather than reporting on Hd95 features selected within each fold during cross-validation, for conciseness here we present results of the Cox PH feature selection method on the entire cohort. Although this introduces information leakage between training and test sets, in our experiments we found that the selected features across folds were highly consistent, thus having minimal impact on the overall prediction performance (details provided in supplementary file). As shown in Table 1, our feature selection resulted in 10 retained Hd95 features for OS prediction, and 4 for PFS.
Subject-level prediction performance. To evaluate the prognostic value of the Hd95 features, in this section we investigate the performance of RSF prediction models trained on different sets of input features. In particular, we are interested in the comparison of models trained with the clinical features alone; the Hd95 features alone; and the combination of both. In addition, we compare with models that include tumor size (either TCV or CEV) and center-of-mass (CoM) as input features, as well as with models that only use age as the clinical variable. Note that feature selection was only performed on the Hd95 features as the clinical, size and location features have all been previously shown to carry prognostic value 6,9,28,[41][42][43][44] .
The first two columns of Table 2 show performance of the RSF models, computed from the cross-validated predictions on the Copenhagen dataset. To quantify the performance of any given model, its predictions were www.nature.com/scientificreports/ compared with the ground truth survival times using Harrell's concordance index (C-index) 56 . The C-index computes the probability that for a pair of randomly selected subjects, their predicted survival is correctly ordered with respect to their true survival times. A C-index value of 1 means perfect prediction performance while 0.5 is the expected result of blindly guessing. The reported C-index is the average over the 100 repetitions of crossvalidation, accompanied by the 95% confidence interval of the mean in brackets. The best model for OS was achieved by combining the proposed features with the previously known prognostic clinical features: further addition of CoM, TCV, and CEV did not provide significant improvement. Individually, the clinical, size and location features all showed lower performance than the Hd95 features for OS prediction, and when combined they achieved only 0.624 C-index, compared to the 0.670 C-index when Hd95 was also included. The Hd95 features thus seem to bring prognostic value that is not contained in simple size and location based features.
For PFS, the best model was achieved by combination of Hd95, clinical, CoM and CEV, achieving a C-index of 0.637. Individually, the CoM was the best predictor of PFS, and combining it with clinical and size features achieved a C-index of 0.622. The benefit of including the Hd95 features is clear for PFS but is considerably lower than for OS. Table 1. Brain structures whose Hd95 feature was selected by the feature selection method are marked with a checkmark, accompanied by L and R denoting left and right-sided structures. Shown for both the Copenhagen and BraTS20 datasets.   www.nature.com/scientificreports/ Risk group stratification. Here we demonstrate that the proposed survival models can be used to stratify patients into low-and high-risk groups. For this purpose, a threshold was selected by searching, among the predictions for all Copenhagen patients, for the value that best separates the dataset in terms of the recorded survival 19,57 . Separation quality was measured with the log-rank test 58 , which tests the hypothesis that two groups have the same survival distribution. The prediction value yielding the lowest P-value (of the log-rank test) was chosen as the threshold separating the low-from the high-risk patients. Visualization of the resulting groups, using the RSF models trained on the combination of clinical and selected Hd95 features as prediction models, is shown with Kaplan-Meier survival curves 59 for OS and PFS in Fig. 6A,B, respectively. We further computed the corresponding hazard ratio for the obtained splits (ratio of hazard rates between the two groups under the proportional hazards assumption 60 ), using the univariate Cox proportional hazards model where the input covariate was the group membership. In addition to the hazard ratio, its 95% confidence interval and log-rank P-value were also computed. For OS, the hazard ratio was 2.65 (1.85-3.79), P = 10 −8 and for PFS the hazard ratio was 1.85 (1.7-4.78), P = 10 −5 . These results show that our survival models can stratify patients into significantly different survival groups for both OS and PFS.
Prognostic potential of individual features. The Hd95 features we propose have a clear anatomical-functional interpretation: higher values reflect more severe deformation in the corresponding brain structures. To test the intuition that highly deformed individual structures are associated with poor outcomes, we concentrated on subjects with very high deformations and tested to what degree their survival differs from that of the remaining subjects. Specifically, for each of the 26 brain structures for which we computed Hd95 features, we split the subjects into two groups according to whether or not they are in the highest 10% range of feature values. We then computed (1) the percentage of short survivors (below the median survival of the cohort) among the subjects in the highest 10% range, and (2) the log-rank test between the two groups.
The results of this experiment are listed in Table 3 for structures where the log-rank P-value was significant. The results show that for several brain structures, high Hd95 value is a strong predictor of short survival. The best predictor of OS was deformation of the left lateral ventricle, where 92% of the subjects with the most deformation were short survivors. For PFS, the best predictor was the deformation of the left thalamus, with 91% of the subjects with the most deformation of that structure being short survivors.  Table 3. Percentage of short survivors among the subjects in the highest 10% range of individual Hd95 feature values. The table also shows the P-value of a log-rank test between the survival times of subjects within and outside the highest 10% range. Brain structures where the log-rank P-value > 0.05 are omitted. www.nature.com/scientificreports/ Results on the BraTS20 dataset. The same methods were applied to the BraTS20 (pre-operative) dataset, for which our goal was to predict the OS only.
Feature selection. Feature selection on the full BraTS20 cohort resulted in selection of 4 Hd95 features: left putamen, left pallidum, left hippocampus and the left amygdala (listed in Table 1 for comparison with Copenhagen dataset). Similar to the Copenhagen dataset results, selecting features within each fold of cross-validation resulted in mostly the same features being chosen (see details in Supplementary File). Note that three of the selected features on the BraTS20 data were also selected for OS prediction on the Copenhagen dataset (vs. two for PFS, cf. Table 1).
Subject-level prediction performance. Model comparison to test whether the Hd95 features contain prognostic information not included in the clinical, size or location data was done in the same manner as with the Copenhagen dataset. An important difference is that the only clinical data available here is the subject's age, while the Copenhagen data also included MGMT and performance status. As shown in the last column of Table 2, the best OS prediction model was obtained with a combination of Hd95, CEV, and age. This is largely in line with the results we obtained for OS prediction on the Copenhagen data (cf. first column of Table 2), where the best models were the ones combining Hd95 with other features. The results for size and location features are similar between the datasets: neither are good OS predictors individually. However, individually, here the age was the best feature, achieving a C-index of 0.581, which is substantially higher than in the Copenhagen dataset where age alone only achieved 0.509. Although the performance of the proposed Hd95 features and CEV individually was quite low (0.571 and 0.534, respectively), combining them both with the age achieved the best C-index of 0.631. While one of the best models for OS prediction on the Copenhagen dataset was the model combining Hd95 with clinical features, that specific combination only achieved a C-index of 0.612 on the BraTS20 dataset. Nevertheless, this is still an improvement over considering either of the two feature sets individually. It is further worth noting that age is the only clinical feature provided in the BraTS20 dataset-addition of MGMT and performance status information could improve the performance and possibly outperform the model using Hd95, CEV, and age also here.
Risk group stratification. As in the Copenhagen dataset, the prediction model trained on the combination of clinical and selected Hd95 features was used to stratify the BRaTS20 cohort. The Kaplan-Meier curves in Fig. 6C show the proportion of subjects alive at any given time point for the two resulting groups. The corresponding hazard ratio was 2.81 (1.84-4.29) and log-rank P-value 5 × 10 −7 , indicating that the two resulting groups have significantly different OS.

Prognostic potential of individual features.
To explore to what degree individual Hd95 features can predict short survival in the BraTS20 dataset, we repeated the experiment of exploring the percentage of short survivors among the subjects with the most deformed brain structures. As shown in Table 3, the highest 10% range of feature values is predictive of short survival for several structures. Compared to our results on the Copenhagen dataset, two new structures show high predictive power in the BraTS20 data.

Discussion
In this paper we have proposed a new set of imaging features for glioblastoma survival prediction. Our main goal was to introduce imaging features that are more interpretable and that can be replicated across different MR contrasts, scanning equipment, or preprocessing. The proposed Hd95 features can be interpreted as measuring the deviation from normal brain morphology due to glioblastoma, and are computed by comparing an automatic whole-brain segmentation with its expected equivalent in healthy subjects. To achieve robustness to missing MR modalities and variations in scanners or acquisition protocols, the automatic segmentations were obtained with a method that was designed to have these properties.
Using experiments on two different datasets-one post-operative and one pre-operative-we showed that the proposed features carry prognostic information and can improve survival models that use conventional clinical features such as age, MGMT, and performance status. Group analysis based on the output of our models showed that they could clearly stratify the datasets into low-and high-risk groups with significantly different survival characteristics. Furthermore, individual feature predictiveness was explored, indicating that for some brain structures, very high deformation is a reliable indicator of short survival.
Through feature selection, we discovered several brain structures whose Hd95 value correlates with survival and were therefore retained for training our prediction models. Although the same set of structures was not selected in each case (OS vs. PFS and pre-vs. post-operative), two structures were selected in all three cases: the left pallidum and left putamen. Interestingly, we found that right-sided structures were overall less associated with survival. Two recent studies have explored the association of OS with left-sided glioblastoma (having higher volume in the left hemisphere than the right side) but with contrary results 43,44 . One study 43 , which showed the association of left-sidedness and worse prognosis, proposed a possible explanation could be that the left hemisphere's functions may be more essential for survival.
In glioma surgery the term "minimal common brain" refers to a probabilistic atlas of primarily centrally located areas of the brain that house the core of connectivity of the main associative pathways, where functional deficits cannot be compensated through functional plasticity and are, thus, considered non-resectable 61 . In our analysis, the most salient brain structures (Table 3) are all located centrally in the brain and are overlapping with "minimal common brain". The biological mechanism that couples structural deformation to OS can be speculated www.nature.com/scientificreports/ to arise from functional disruption in these critically important structures and/or their pathways, that cannot be compensated for by functional reorganization leading to an early decrease in clinical performance status and death. The functional disruption could arise from mechanically induced pressure caused by mass effects, disruption of blood supply, and direct tumor infiltration. By automatically computing detailed segmentations of both the tumor and the surrounding brain, the method presented here makes it easy to analyze the relative role of both aspects on survival prediction, as we have done in our experiments (Table 2). Nevertheless, answering long-standing questions such as whether patients with tumors presenting more infiltration than mass-effect have a better prognosis, remains difficult. This is because FLAIR MRI signal changes are unspecific and cannot be interpreted as identical with infiltration: Pre-treatment these changes can represent tumor infiltration or edema, and post-treatment additionally ischemia, demyelination and gliosis. Furthermore, prognosis needs to be interpreted within specific tumor groups, which is challenged by continuous revisions of pathological classifications, last in 2021. For GBM, classified by 2007 standards, a progression pattern without contrast enhancement had a better OS than progression with, likely because of the hypoxic and toxic environment associated with the contrast enhancing area 62 . In terms of recurrence pattern after radiation therapy, only 6% of GBM patients had distant recurrences, indicating poor local control close to the original contrast enhancement to be the primarily cause, and not tumor distant infiltration 63 . GBM patients with a longer progression free survival, e.g., MGMT-methylated patients, are more prone to non-central failures. Thus, treatment failure in contrast enhancing areas is the primary reason for a poor GBM prognosis while infiltration mainly influences OS in long survivor patients.
As demonstrated in our experiments, the features proposed in this paper readily generalize across datasets: They are independent of scanner and imaging parameters, and they can be computed from both pre-or postoperative images; from data that is skull-stripped or not; and from subjects with missing modalities. We did, however, see worse prognostic performance of the Hd95 features for OS prediction on the pre-operative cohort (BraTS20) compared to the post-operative one (Copenhagen). One possible reason for this discrepancy may be that the BraTS20 dataset contains both glioblastoma and anaplastic astrocytomas, which have different survival characteristics. The fact that BraTS20 is pre-operative may play an important role as well, as the effects of surgery can not be taken into account.
Although the extent of resection (EOR) has been shown to be an important independent prognostic factor 45 , EOR was not recorded in our post-operative dataset and was therefore not used in our experiments. However, we did consider the volume of each tumor component, including both the contrast-enhancing tumor volume and resection cavity in case of post-operative images (cf. Table 2). It has been argued by the response assessment in neuro-oncology (RANO) working group 64 that EOR does not represent the biological reality of the tumor burden, and the use of post-op residual tumor volumes is recommended as the measurement of cytoreductive treatment effect, as used e.g., in the fluorescence-guided surgery trials with 5-aminolevulinic acid (5-ALA) 65 . Since a 90% resection of a 1 mL and a 100 mL GBM is not the same thing, we have placed emphasis on the absolute and not relative measure of tumor remnants in our experiments.
The proposed Hd95 features measure how much each brain structure is deformed compared to its expected shape in the absence of pathology, and therefore they contain information about the location of the tumor, which has been shown previously to be a prognostic factor for OS 9,28,[41][42][43][44] . Nevertheless, our results show that the proposed Hd95 features carry richer prognostic information for predicting OS than tumor location alone. For predicting PFS, tumor location was found to be a stronger predictor than Hd95 when considering these feature sets individually; however, substantially higher model performance was achieved with a combination of the two, together with clinical and size features. To the best of our knowledge, considerations of tumor location has not been a parameter used in the stratification of patients to treatment in clinical glioblastoma trials, although the poor prognostic feature is recognized in clinical management. Based on our results, the application of survival models exploiting advanced imaging features, such as the ones proposed here, could potentially help minimize bias in stratification in future clinical trials. High quality prognostic information could also potentially guide clinicians in adjusting the intensity of interventions, based on expected outcome and quality-of-life considerations.
Since the submission of this paper, independent research that also uses peri-tumoral deformation for survival prediction has been published elsewhere 32 . In that work, three tumor compartments are manually delineated from pre-operative T1c, T2 and FLAIR scans; the T1c scan with the tumor masked out 66 is nonlinearly deformed to a template; and features summarizing the properties of the resulting deformation field are then computed (five first-order statistics of the deformation field magnitude in each of twelve equidistant bands drawn around the tumor boundary). In contrast to this registration-based approach centered around manual tumor delineations, the method proposed here is segmentation-based and fully automatic: Not only does it perform detailed segmentations of both the tumor region and the surrounding brain directly from multi-contrast MRI scans (including post-operative ones), the method also computes its predictions from the shape of individual neuroanatomical structures instead. Such segmentation-based features are arguably easier to interpret than the features used in 32 , which closely resemble classical histogram radiomics features (in casu mean, median, standard deviation, skewness, and kurtosis of deformation field amplitudes in 5mm thick bands around the tumor).
While radiomics studies focus on patterns within the tumor region, in this study we have focused on the rest of the brain and ignored the tumor region itself entirely. Using such an approach, we demonstrated that considering out-of-region deformation features together with conventional clinical prognostic factors significantly improves survival models. A recent study 19 showed how 18 radiomic features could similarly improve RSF model accuracy when combined with clinical features. Future work may therefore explore combining both withintumor radiomic features and our Hd95 features to further improve model accuracy, following the example of 32 .